这 性能悖论 表明,一个数学上完美的内核(例如 $out = x + y$)如果无法分摊 GPU 硬件的固定开销,其实际性能可能反而比 CPU 循环更差。这种现象通常表现为 启动开销。
1. “正确性” 的误区
功能正确性并不等同于效率。尽管你的 Triton 代码可能正确地将任务分配到数千个线程中,但如果总工作量(N)较小,GPU 将处于低效利用状态。硬件花费在状态切换上的时间远多于实际执行算术运算的时间。
2. Python 测量陷阱
使用 Python 对 GPU 代码进行基准测试时 time.time() 存在风险。GPU 调用是 异步;Python 只是 排队 命令并继续执行。若不使用 torch.cuda.synchronize(),你测量的是排队时间。通过同步,你测量的是 主机到设备的延迟,这通常比内核执行本身长 10 倍。
3. 延迟与吞吐量
为克服这一悖论,你必须提供足够的工作量来“隐藏”启动延迟。这正是从 延迟受限 模式(受限于 CPU-GPU 总线)转变为 吞吐量受限 模式(受限于 GPU 内存或计算能力)。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
For each kernel, decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead: Vector addition (N=256), Vector addition (N=10^8), and Matrix Multiplication (N=8192).
N=256: Arithmetic; N=10^8: Bandwidth; MM: Launch
N=256: Launch; N=10^8: Bandwidth; MM: Arithmetic
N=256: Bandwidth; N=10^8: Arithmetic; MM: Launch
All are compute-bound.
✅ Correct!
At very small N, launch overhead dominates. Large vector adds are memory-bandwidth limited. Dense Matrix Multiplications have high arithmetic intensity and become compute-bound.❌ Incorrect
Think about the ratio of math to data movement, and the constant cost of starting a kernel.QUESTION 2
In the context of the Performance Paradox, what is the primary bottleneck for a 'ReLU on a matrix' operation?
Arithmetic Throughput
Memory Bandwidth
Register Pressure
L1 Cache Size
✅ Correct!
ReLU is memory-bound. It performs one very simple comparison (max(0,x)) for every load and store, resulting in extremely low arithmetic intensity.❌ Incorrect
Does ReLU perform complex math, or does it spend most of its time moving data to and from HBM?QUESTION 3
What does the term 'Asynchronous Execution' imply regarding GPU benchmarking?
The GPU and CPU always finish at the same time.
The CPU continues to the next line of code before the GPU kernel finishes.
The kernel runs faster on smaller GPUs.
Memory transfers are blocked by compute.
✅ Correct!
Correct. This is why synchronization is required for accurate timing; otherwise, you just time how long it took to send the command.❌ Incorrect
If the CPU waited for every GPU call, performance would be significantly worse due to constant idle cycles.QUESTION 4
Why does $out = x + y$ exhibit low arithmetic intensity?
It uses three memory accesses (2 loads, 1 store) for a single floating-point operation.
The addition operation is too complex for the ALUs.
It requires shared memory synchronization.
It only runs on one SM.
✅ Correct!
High-performance compute requires many FLOPs per byte moved. Vector add is the opposite, making it bandwidth-limited.❌ Incorrect
Count the number of times you access memory (tl.load/tl.store) versus the number of math operations (+).QUESTION 5
How can the 'Launch Tax' be amortized in a real-world application?
By calling the kernel more frequently with smaller data.
By increasing the workload per launch (e.g., larger N or batching).
By using 16-bit floats instead of 32-bit floats.
By disabling the L2 cache.
✅ Correct!
Increasing the workload makes the fixed overhead a smaller percentage of the total execution time.❌ Incorrect
Smaller data sizes actually make the launch tax more prominent relative to the useful work.Case Study: The Overhead Audit
Interpreting Host vs. Device Benchmarks
A developer runs a Triton kernel for Vector Addition on 512 elements. They measure 45 microseconds using Python's `time.time()`. When profiling the same kernel using NVIDIA Nsight Systems, the actual GPU duration is reported as only 2.1 microseconds.
Q
1. What is the approximate 'Launch Tax' in microseconds for this scenario, and what percentage of the total measured time does it represent?
Solution:
The Launch Tax is approximately 42.9 microseconds (45ms total - 2.1ms work). This represents ~95.3% of the total time. This indicates the application is heavily bound by system overhead rather than computation.
The Launch Tax is approximately 42.9 microseconds (45ms total - 2.1ms work). This represents ~95.3% of the total time. This indicates the application is heavily bound by system overhead rather than computation.
Q
2. If the developer increases N to 1,000,000 elements, assuming the kernel now takes 150 microseconds on the GPU, how does the Launch Tax impact the overall efficiency?
Solution:
With a constant launch overhead of ~43us, the total time would be ~193us. The overhead now only accounts for ~22.3% of the time. Efficiency improves as N increases because the fixed cost is spread over a much larger volume of compute/memory work.
With a constant launch overhead of ~43us, the total time would be ~193us. The overhead now only accounts for ~22.3% of the time. Efficiency improves as N increases because the fixed cost is spread over a much larger volume of compute/memory work.